Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Creating word-level language models for large-vocabulary handwriting recognition

Identifieur interne : 007B68 ( Main/Exploration ); précédent : 007B67; suivant : 007B69

Creating word-level language models for large-vocabulary handwriting recognition

Auteurs : John F. Pitrelli [États-Unis] ; Amit Roy [États-Unis]

Source :

RBID : Pascal:03-0386044

Descripteurs français

English descriptors

Abstract

We discuss development of a word-unigram language model for online handwriting recognition. First, we tokenize a text corpus into words, contrasting with tokenization methods designed for other purposes. Second, we select for our model a subset of the words found discussing deviations from an N-most-frequent-words approach. From a 600-million-word corpus, we generated a 53,000-word model which eliminates 45% of word-recognition errors made by a character-level-model baseline system. We anticipate that our methods will be applicable to offline recognition as well, and to some extent to other recognizers, such as speech recognizers and video retrieval systems.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Creating word-level language models for large-vocabulary handwriting recognition</title>
<author>
<name sortKey="Pitrelli, John F" sort="Pitrelli, John F" uniqKey="Pitrelli J" first="John F." last="Pitrelli">John F. Pitrelli</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>IBM T.J. Watson Research Center, P.O. Box 218</s1>
<s2>Yorktown Heights, NY 10598</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Yorktown Heights, NY 10598</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Roy, Amit" sort="Roy, Amit" uniqKey="Roy A" first="Amit" last="Roy">Amit Roy</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>IBM T.J. Watson Research Center, P.O. Box 218</s1>
<s2>Yorktown Heights, NY 10598</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Yorktown Heights, NY 10598</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">03-0386044</idno>
<date when="2003">2003</date>
<idno type="stanalyst">PASCAL 03-0386044 INIST</idno>
<idno type="RBID">Pascal:03-0386044</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000772</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000271</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000715</idno>
<idno type="wicri:explorRef" wicri:stream="PascalFrancis" wicri:step="Checkpoint">000715</idno>
<idno type="wicri:doubleKey">1433-2833:2003:Pitrelli J:creating:word:level</idno>
<idno type="wicri:Area/Main/Merge">007F74</idno>
<idno type="wicri:Area/Main/Curation">007B68</idno>
<idno type="wicri:Area/Main/Exploration">007B68</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Creating word-level language models for large-vocabulary handwriting recognition</title>
<author>
<name sortKey="Pitrelli, John F" sort="Pitrelli, John F" uniqKey="Pitrelli J" first="John F." last="Pitrelli">John F. Pitrelli</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>IBM T.J. Watson Research Center, P.O. Box 218</s1>
<s2>Yorktown Heights, NY 10598</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Yorktown Heights, NY 10598</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Roy, Amit" sort="Roy, Amit" uniqKey="Roy A" first="Amit" last="Roy">Amit Roy</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>IBM T.J. Watson Research Center, P.O. Box 218</s1>
<s2>Yorktown Heights, NY 10598</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Yorktown Heights, NY 10598</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint>
<date when="2003">2003</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Character recognition</term>
<term>Handwriting recognition</term>
<term>Language recognition</term>
<term>Pattern recognition</term>
<term>Syntactic analysis</term>
<term>Token</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Reconnaissance écriture</term>
<term>Reconnaissance forme</term>
<term>Reconnaissance langage</term>
<term>Reconnaissance caractère</term>
<term>Analyse syntaxique</term>
<term>Unigram</term>
<term>Tokenization</term>
<term>Word-level language model</term>
<term>Jeton</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">We discuss development of a word-unigram language model for online handwriting recognition. First, we tokenize a text corpus into words, contrasting with tokenization methods designed for other purposes. Second, we select for our model a subset of the words found discussing deviations from an N-most-frequent-words approach. From a 600-million-word corpus, we generated a 53,000-word model which eliminates 45% of word-recognition errors made by a character-level-model baseline system. We anticipate that our methods will be applicable to offline recognition as well, and to some extent to other recognizers, such as speech recognizers and video retrieval systems.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
</list>
<tree>
<country name="États-Unis">
<noRegion>
<name sortKey="Pitrelli, John F" sort="Pitrelli, John F" uniqKey="Pitrelli J" first="John F." last="Pitrelli">John F. Pitrelli</name>
</noRegion>
<name sortKey="Roy, Amit" sort="Roy, Amit" uniqKey="Roy A" first="Amit" last="Roy">Amit Roy</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 007B68 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 007B68 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:03-0386044
   |texte=   Creating word-level language models for large-vocabulary handwriting recognition
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022